Generic chaining and the £i-penalty 

Sara van de Geer 



(N 

o 

>V 



Abstract We address the choice of the tuning parameter A in ^-penalized 
M-estimation. Our main concern is models which are highly nonlinear, such 
as the Gaussian mixture model. The number of parameters p is moreover 
large, possib ly larger than th e number of observations n. The generic chaining 
technique of Talagran d [2005] is tailored for this problem. It leads to the choice 
A x yJ\ogp/n, as in the standard Lasso procedure (which concerns the linear 
model and least squares loss). 
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Let X\ , . . . , X n be independent observations with values in some observation 
space X, and let for 9 in a parameter space 6 C F be given a loss function 
Pq : X — > R. The parameter 9 is potentially high-dimensional, i.e. possibly 
p> n. In this article we study the t\ -regularized M-estimator 



argmm< P n p e 



\\\t 



Here, we use the notation P n pe '■= Y^h=1 Pe{Xi)/n, i.e., it is the empirical 
measure of the loss function pg, often referred to as the empirical risk. Moreover, 
A > is a tuning parameter and \\9\\i := Y^j=i ls the ^i-norm of 9. 

A special case is the Lasso (|Tibshiranil [l99d ]). which has quadratic loss: 

p e (X) := (Y-9 T Z) 2 , X = (Y,Z), 
where Y S R is the response variable and Z £ W are covariables. There 
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are m a ny papers on the La s so, see for example Ivan de G eer 20011. Bu nea et ah. 
' '200dl . iBunea et alJ f2007all iBunea et al.1 |2007bl | . Ivan de Geerl |2008l | . iKoltchinskii 



2009] . For an overview and further results, see also 



Biihlmann and van de Geerl 20111 ] . It is known that generally the choice A x 
\J\ogp/n is appropriate. Under some distributional assumptions, this choice 
leads to favorable theoretical properties of the Lasso, such as good oracle bounds 
for the estimation and prediction error. 



In this note we address the following question: is the choice A X \J\ogp/n also 
appropriate for non-linear situations? The example described above is a linear 
situation. More generally, we call the situation linear if for some tp : X — > R p 



(P n -P)(pe 



9) T (P n - P)xj>, V 9, 9, 



where Ppg := ^ ^2^=i^Po(^i) is the theoretical risk. Any generalized lin- 
ear model (GLM) loss function with canonical link function and fixed design 
is a linear situation. Also density estimation using an exponential family is 
a linear situation. A non-linear situation occurs for instance in linear least 
squares regression with random design. Our focus is more on other examples, 
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such as mixture models ( Sta dler and van de Geeri [20101 ]') or mixed effect models 
(|Schelldorfer et al.1 |201l| ). etc. ~ 



Let us define the "true" parameter 

9° := argminPpe, 

060 

where we assume pg is defined for all 9 in the possibly extended space D 0. 
Let 9* E O be some "approximation" of 0°. Here, we have in mind the best 
approximation within O (in the case of misspecified models), and possibly the 
best "sparse" approximation (see Remark 12, II for a definition). Our choice for 
the tuning parameter is governed by the behavior over ^i-balls @m(9*) := {9 E 
0* : || 9 — 9*\\ i < M} of the empirical process (P n — P){po — Pe*)-, where 
0* = or, in the case of Theorem 12.21 (convex loss) 0* is the smallest convex 
set containing 0. 

In the linear case, the supremum of the empirical process can be easily bounded 
using the dual norm inequality 

sup \(P n -P)(pe-pg)\<\\(P n -P)^\\ooM, (1) 

0ee M (0*) 

where for a vector v E W, ||u||oo := maxi<j< p \vj\ is the uniform norm. More- 
over, for example for A/"(0, l/n)-random variables {Vj} p - =l (say), it holds that 

max|VS|=0 P f^V 



We show in this paper that in many non-linear cases, one still has 

• Op | \ I — — ^- \M. 



sup \(P n - P){pe - pe* 
6<=e M {0*) 



n 



(2) 



This follows rather easily from a generic chaining (iTalaerandl |l996ip and Su- 
dakov minor ation argument. We will use the book iTalagrandl [20051 ] . 

In the case of regression with robust GLM loss (robust quasi-likelihood loss 
functions, quantile functions), we have 



p e (X) = p(Y,9 T Z), 



with p(y, •) Lipschitz for all y. In that sit uation, one may apply the contraction 
inequality ( Ledoux and Talagrand 199 ll ]) to arrive at ([2]). We will explain this 
in Subsection [ 



Our emphasis is however on cases that go beyond GLM loss. An example is the 
Gaussian mixture model 



Pe 



(Y, Z) = log (j2 Kk<j>o h (Y-ffiZ k )\ 
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where <p a = 4>(-/cr)/a is the density of the Af(0, a 2 )-distribution, {7Tfc}£ =1 are 
mixing coefficients (Xwfc=i = l)j /9fe and are vectors in MP k , k = 1, . . . , r, 
and where 1" is again a response v ariable and Z T := (Zj ', . . . , Zj) a covariable. 
This model has been studied in IStadler and van de Geerl 20101 ] The tuning 

parameter is there taken of order A x \J log 3 n log(p V n)/n. The parameters 
in this model are 9 := (n, a, f3\, . . . , f3 r ) (and in lStadler and van de Geerl [20ld |. 
the penalty is A||/3||i = A^^ =1 ||/8jfc||i, i.e., it does not include the parameters 
7T and a). The model is definitely non-linear. It is essentially a GLM albeit 
that there are r linear functions involved instead of just one, and there are 
the further parameters it and a. We call such a model an extended GLM. The 
contraction inequality will not help us anymore in this case, but as we will see 
in Subsection 14.41 the generic chaining argument gives a multivariate version of 
the contraction theorem. This leads to the reduced choice A x ydogp/n. 

Another situation is where pg is a general non-linear function. In that case, we 
will restrict ourselves to the medium-dimensional situation with p sufficiently 
smaller than n. Again the generic chaining bound can be used. 

Our results rely on the following condition. 

Condition 1.1 (Componentwise Lipschitz condition) There exist functions {4>j} 
(t/jj : X x {1, . . . , n} — > and constants {c^g} such that or all 9 and 9 in 0* 



[pe(Xi 



[p§{Xi 



c,, 



V i. 



The constants {ci t g} will generally be either all zero, or equal to the expectation 
Ci fi =E Pe (Xi). 

Generic chaining gives a bound 72 (following the notation of Talagrand 2005]) 
for the supremum of stochastic processes (see Theorem 15. ip . This bound 72 
is defined by the geometry of the index set of the process. By Sudakov's mi- 
nor ation 72 is also a lower bound in the case of Gaussian processes. This is 
the argument we will use. It means that we need not directly calculate 72 but 
instead obtain an upper bound for free. Nevertheless, it would be of interest 
to direc tly bound 79 usin g geometric arguments (Tal agrand 's resear ch problem 
2.L9 in [TaTagrandl [2005] ] ). The Dudley bound (see budlevi |l967j l or budlev 
results in additional (and hence superfluous) (log n)-factors (see Section 

ED. 

We remark that the bounds are based on arguments for Gaussian processes, and 
in fact on the behavior of maxima of i.i.d. Gaussians. This is so to speak the 
worst case: the bounds are here the largest. In particular for random variables 
which are highly dependent, one may have smaller bounds. Moreover, in the 
statistical application of ^-regularized estimation, strong dependencies may 
lead to choosing the tuning parameter A of much sm aller order than ydogp/re. 
This is explained in van de Geer and Lederer 20121 ] for the case of the Lasso. 
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It means that even when the result 



sup | (P n 



P){pe - Pe*)\ = Op 



logp 



n 



M, 



leaves no room for improvement, there are situations where the choice A x 
-\/loKp/n is much too la r ge. W e will not address this issue here but refer to 



van de Geer and Lederei 



2012l | . 



That generic chaining arguments can be used to theoretically show that A x 
\J\ogpJn is appropriate is perhaps of little practical value. One may argue 
for example that cross-validation will rather be used in practice, instead of a 
theoretical value. Our finding is primarily interesting from a theoretical point 
of view. 

Generic chaining plays an impo rtant role in the statistics litera ture, for example 
to empiri cal risk minimization ( Bartlett and Mendelson 20061 ]) . PAC-Bayesian 
learning (Audibert and Bousquetl 2007 ]). and the Lasso with random design 



( Bartlett et al 



2009l |). We believe the application in this paper, addressing 
the choice of the tuning parameter A in £i-regularization for M-estimators, is an 
nice opportunity to clearly demonstrate the elegance of Talagrand's approach. 



1.1 Organization of the paper 

In Section O we review the basic oracle inequality for the ^i-penalized M- 
estimator. This purpose of this section is to highlight the role of the supremum 



sup | (P„ 

0ee A/ (0*) 



P){pe-pe*)Y 



The p roofs of Theorems 12.11 and 12.21 follow closely iBiihlmann and van de Geer 
I , and are given for completeness in Section [7J In Section 0] we show that 



2011 



E 



sup \Pn(p c e 
8€<3>m(6*) 



X 



logp 



M. 



Here, P^ is the symmetrized measure defined in Section[3]and X := [X\, . . . , X n ). 
Moreover, pg(Xi,i) = pg(Xi) — Ci ; $, with the constants c^e as in Condition ll.il 
Section [3] summarizes why bounds on the conditional mean of the symmetrized 
process suffice: t hey lead to expo nential probability inequalities using a devia- 
tion inequality of |M^ssart| 2000al | . Section [5] gives the details concerning generic 
chaining and a consequence con cerning the geometry of £i-balls. It summarizes 
some results in lTalagrand 2005 ] and makes a comparison with Dudley's entropy 
bound. 



2 The oracle inequality 



We let for 9 and 9* in O, 

Y(9,9*) := (P n -P){ pe - pet ). 
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In this section, we show why bounds for supg e Q M fQ*\ \Y(9,0*)\ can be used 
to choose the tuning parameter A and arrive at an ora cle inequality for the li - 
regula rized M-estimator (3. The line of reasoning is as in lBiihlmann and van de Geer 



201 11 ] . Define the excess risk 

£(0;6 O ) :=P(p e -pe ) 



The following condition quantifies the curvature of £(9; 9q) around its minimizer 

e . 

Condition 2.1 (Margin condition) We say that the margin condition holds for 
all 9 G &m (6*) if for some norm r on @, and some strictly convex non-negative 
function G, satisfying G(0) = 0, 

£ (0; 0°) > G(t(0 - )), V 9 e @m(9*). 

Definition 2.1 (Convex conjugate) Let G be a strictly convex non-negative 
function with G(0) = 0. The convex conjugate of G is 

H(v) := sup< uv — G(u) >, v > 0. 
u>0 { J 



For sets S and vectors 9 £E MP we let 

9 jtS :=9jl{jeS}, j = l,...,p. 
Definition 2.2 (Effective sparsity) Let 

5(L,S) :=mm{T(9) : = 1, < L}. 

Then T 2 (L,S) := 1/5 2 (L,S) is called the effective sparsity (of the set S). 

Following van de Geeri (200 <j we call ^ 2 (L,S) := |S|<5 2 (L,S) the compatibility 
constant (for the set S). If it is not too small, the norms r and the ^i-norm 
|| • ||i are "compatible" with each other. 

We define for some constant Ao, the set 

T M (0*) :={\Y(9,9*)\ < A o ||0-0*||i VA 2 , V 9 € &m(0*)}, 
and let T{9*) := Too{9*) and G^*) = 9. 



Our task in Sections [3] and [4] is to show that with Ao X ydogp/n, the set Tm{9*) 
has large probability (for any 9* and suitable M). 

We first give in Theorem 12.11 a result where the margin assumption is assumed 
to hold " globally" . We then refine this in Theorem 12.21 to local conditions for 
the convex case. 
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Theorem 2.1 Let A > Ao- Assume Condition \2.1\ (the margin condition) for 
all 9 G 0. Let H be the convex conjugate of G. If 9° G @, we have on T(9q), 
for all < 5 < 1, 



(1 - S)£{p; ) + (A - X )\\e - 9% < Sh( ^^- \ V 2A 2 , (3) 
with L = (A + Ao)/(A — Ao). Moreover, for all < 5 < 1 and a// 0* G 0, on 

(i-<y)^;fo) + (A-Ao)||e-e*||i 
< 2(Jg ^ 4(i + 5)Ar(L, ; s,) \ v 2A2 + (1 + me *. 9o)j (4) 

with L 8 = 2 ((1 + <5)/5) ((A + A )/(A - Ao)). Fere 5, := {j : 0* ^ 0} is the 
support set of 8* . 

The proof of Theorem 12.11 is given in Section UJ 

Remark 2.1 With the above result, one can define the best sparse approxima- 
tion as a solution 9* of the minimization 

fi§ {2^( 4 " + ^- s °> ) v 2A> + (i + sne M } . 

where Sg := {j : 9j / 0}. 

The next theorem assumes convexity and then needs the margin condition only 
in a neighborhood of 9*. 

Theorem 2.2 Let A > Ao- Let 9* be the smallest set containing and suppose 
that the map 9 \— )■ pg, 9 G 0* is convex. Let (A — Ao)Mo and (A — Ao)M* be the 
bounds given in the right hand side of (0) and respectively, i.e., 

S I „^2Ar(L,S„)V,„,2 



with L = (A + Ao) /(A — Ao) . and 

M» : = ^ | Mg ( ja±jM^l ) V + (1 + *)£«,•;«}, 

mi/i Lg = 2 ((1 + 5)/5) ((A + Ao)/(A — Ao)). ffere, H is a strictly convex in- 
creasing function with H(0) = 0. 

If 9° £ Q and the margin condition holds for all 9 G ©2Afo ($())> with G the 
convex conjugate of H , then again on T2M {9q), 

{1-5)£{9;9 Q ) + {X-X )\\8-9°\\ 1 < (A - A )M 

For general 9* , if the margin condition holds for all 9 G @2M*{&*)> with G the 
convex conjugate of H, then again on T2M*(9*), 

(1 - 26)£(9;9 ) + (A - A o )||0 - 0*||i < - Ao)M*. 
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The proof of Theorem 12.21 is also in Section [7J 



Remark 2.2 To handle the set 7m (0*) we prove in Sections^ and\J^ that with 
Ao x ydogp/n, with large probability 

sup |y(0,0*)| < XqM 

for all M < const, and then apply the peeling device (the latter being detailed 
in Subsection \3.4\ ). However, as is clear from the proof of Theorem \2.2X one 
can refrain from peeling in the convex ca se, because one alre ady places oneself 
in a s uitable neighborhood of 9* (see also \van de Geei\ \200i l and van de Geen 

MM])- 

Remark 2.3 The mode ls considered in Stddler and van de Geer 201 d] and 



Schelldorfer et al 



There, the margin condition holds 
in a bounded neighborhood and these bounds are imposed on the parameters. 
Then the peeling device is invoked. 



3 Symmetrization, contraction and deviation inequal- 
ities, and the peeling device 



We write the sample as X := (X%, . . . , X n ). Let £ 1} • • • , e n be a Rademacher 
sequence independent of X. For constants {c^g} (which we will choose as in 
Condition II. ip . we define 

Pg(Xi,i) = pe(Xi) - c it g, i = l,...,n, 

and the symmetrized empirical process 



1 - 

P nPe ■= - Y][pe(Xi) - c ij8 ]ei, 6 G 0*, 



i=l 



and we let 



For a function g : X X {1, . . . n}, we use the notation 



1 n i n 

g\\l:=-Y.9\X^), \\ 9 f:=^Y.^\X,i). 
n * — ' n *r- i 



i=l 



i=l 



In this section, we summarize the arguments that show that up to constants, 
one can reduce the problem of deriving probability inequalities for the process 
Y(6,6*) to studying the symmetrized process Y £ (9,9*). In fact, we only need 
bounds for the conditional expectation 



E n :— E 



sup |y £ (0,0*) 
.6ee M (6*) 



X 
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Alternatively, one can use direct arguments in certain regression problems (with 
sub-Gaussian errors) or invoking or example Bernstein's inequality (but then 
one has to adjust Sudakov's minoration argument to the case of independent 
Gamma-distributed variables) . We also discuss the peeling device (but as noted 
in Remark 12.21 this device is not always needed). 



3.1 Symmetrization 



We cite the following result (see iPollardl [19841 ] ) . 

Lemma 3.1 Let R := swpg E Q M rg*\ \\pg — p c Qt || and let t > 4. Then 



P( sup |y(0,0*)l > ^R\j— ) < 4P( sup \Y 



2t 



n 



> R 



eee M (e*) 



3.2 Contraction 



Suppose that for all 9,6 G 0, 



p c e (Xi,i) - p c JXi,i) 



< \MXi,i) -h{Xi,%) |, v», 



for some functions fg:Xx {1, . . . , n} — >• R, & & 0*. 



By the contraction inequality of lLedoux and Talagrandl 19911 ] . 



En '■= E 



sup \Y £ (6,6*) 



X ] < 2E 



sup \X £ (9,9*) 
l0ee M (0*) 



with 



X 



X £ (9,9*) := P £ (f e - /„.) := - Y^eMX^i) - f § (X l} i)). 



i=l 



3.3 A deviation inequality 

Write 



R n := sup \\p c e - p% 
eee M (e*) 



We have for all i > (see iMassartJ (ioooj), 



sup \Y £ {B,6*)\ 
.eee M (8*) 



>E n + R n \j- ) <exp[-t]. 



Combining this with the symmetrization result of Section 13.11 we obtain the 
following corollary. 
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Corollary 3.1 Let for some R 



sup 

0eO M (0*) 



and let t > 4. Then for any E, 
P 



sup \Y(e,e* 



< 4exp[-t] + 4P(# n > RV E n > E). 

As for the random variables E n and R n , in our context we use Condition 11.11 
Consider first R n . Condition 11.11 yields by the triangle inequality 



sup 

0ee M (8 



In < MK n 



where 



Thus, on the set 



K n := max I 

i<j<P 



lira- 



To := \ max \\ipj\\ n <K\, (5) 
[i<j<p J 

(where K is some constant) we can bound the random radii R n by MK. We 
will see in Section H] that a bound for the conditional expectation E n also only 
involves K n : 

E n < X MK n , 



for some constant Ao X ydogp/n. 

In some cases (regression with fixed design) K n is not random, and the assump- 
tion maxi<j< p HV'jlln < K is a matter of norm alization . In o ther situations, one 
can for example apply Bernstein's inequality (jBennetl 19621 ]): 



Lemma 3.2 Suppose that the ipj(Xi) are uniformly sub-Gaussian, that is, for 
some positive constants L and r , it holds for all j, 



n ^ 

i=i 



Eexp[tf(Xi)/L 2 }-l 



Then for all t > 0, 



P ( max 

A<j<P 



kj lln 



> 2rL^ t + >OS "K m±j£M) < 2expH] . 



II 



n 



Proof. The sub-Gaussianity implies that for all m G {1, 2, 3, . . .}, 



-t Tl ~t~ 2/71 I *^ 



n 



i=l 



i=l 



Eexp[^(Xi)/L 2 ]-l 



< ^ L 2(m-l) T 2 
~ 2 
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But then 



1 ™ 

i=l 



~ 2 



T . 



By Bernstein's inequality (jBennetl 19621 ] ). for all t > 0, 

(/ 2t 2Lt\ 
\{P n - P)$\ > 2tL\ - + < 2exp[-t], 

J V n n J 

and hence, by the union bound, for all t > 0, 

Pf max |(P n - P)^| > ^ 2(t+ re l0gp) + 2L(t + l0gP) ) < 2exp[-t]. 



\i<i<p 



□ 



The assumption of sub-Gaussianity is not a necessary condition. One may 
replace it by an m-th order moment condition, with p 2 /n m sufficiently small. 
However, we then will no longer have exponential probability inequalities. 



3.4 The peeling device 



The peeling device goes back t o I Alexander! 19851 ]. the terminology being in- 
troduced in Ivan de Geerl [20001 ] . In the present context we can use it in the 
following form. 

We show in the next section that under certain conditions 

E n < X MK n , (6) 

where Ao >? ^log p/n, and K n := maxi<j< p H^lln- Then, under Condition 11.11 
and the sub-Gaussianity assumption of Lemma 13.21 we have for all M, and all 
t > 0, 



pf sup \Y(9,e*)\ > — ( I + K 

We M (0*) e 



t t 
+ 



log p n 



<6exp[-t]. (7) 



with K* depending on L and r but not on n and p, and A* x \f\ogp/n. This 
follows from ([6]) and from Subsection 13.31 The constant 6 in the right hand side 
of inequality ([7]) comes from a 4 from the symmetrization plus a 2 from Lemma 
13.21 (we actually may replace 2 by 1 here because we only need a one-sided 
version) . 

Once ([7]) is established, we can invoke the peeling device as follows. Let M be 
fixed, and let Mj := e~ j M, j = 0, . . . ,p. Then for all t > 0, 



P ( sup 



\Y(6,6*)\ 



L Ve"(P" 1 )M 



> A* 1 + if* 



'£ + logp Jt _ t + l°gP 



logp 



n 



<Vpf sup \Y(6,6*)\>\*Mj[l + K„ 

j=1 KeeeM.^ie*) 



' t + log p t + log p 



logp 



n 



< 6exp[logp— (logp + t)} < 6exp[— t]. 



10 



4 Bounds for the symmetrized process 



In the previous section we argued that the main task is to establish bounds for 
the expectation of the symmetrized process, i.e., for 



E 



sup \Y e (6,6*)\ 



X 



We are looking for bounds of the type ©. One can then derive deviation 
inequalities as shown in Section[3l and hence (as shown in Theorems I2.1l and l2.2p 
theoretical bounds for the tuning parameter of the ^-regularized M-estimator. 



4.1 Linear functions 



Lets us briefly recall the linear case. Let [pe(X; L ) — c^g] — [p§(Xi) — c i §] be linear: 



[pe(Xi) - c hd ] - \p s (Xi) - = - ^(I^i), i = 1, 



, n. 



One then clearly has 
E 



sup \Y £ {B,6*)\ 



X < MUe^/nlloo. 



Moreover, by Hoeffding's inequalit y (jHoeffdingl 19631 ] . see also Lemma 14.14 in 
Biihlmann and van de Geerl (201 



El \\e T ^/n 



where K n : = maxi 



<j< P WW j\\n- 



n 



4.2 Generalized linear functions 



Suppose that for all 9,6 £ O*, 



\pe(Xi) -E Pe (Xi)] - [p-AXi) -Vp-AXi)] 



< \f e {X h i) - f~AX h i)l V», 



where fg(Xi,i) = Y^=i ^jWji^iA)-, Q G ©*• Then by the contraction inequality 
of Subsection 13. 2| and the arguments of Subsection 14.11 for the linear case 



E 



sup \Y £ { 
.0ee M (0*) 



with K n := maxi^xp ||^-|| n . 
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4.3 Extended generalized linear functions 



Condition 4.1 (Extended GLM condition) The exist non-negative functions 
{^j,k '■ J : = 1) • • -Ph-, k = 1, . . . ,r} (with J2k=iPk = p) such that for all 6 and 9 
in 0*, it holds that 



r Pk 



\[p e {Xi) - Ci >6 ] - [pg{Xi) - c it §)\ < E | ^2(9j,k ~ Oj,k)^j,k( x i^)\^ i = h---,n. 

k=l j=l 



Theorem 4.1 (Multivariate contraction theorem) Assume Condition \4-l\ Let 
£i,jfc> • • • ,£,n,k> k = 1, . . . ,r, be independent N (0, 1)- distributed random variables, 
independent of X\ , . . . , X n . Let 



n Pk 

X k (9, 9*) := -= E 52(0j,k ~ 9l k )ip j , k {X i , »)&,*, 



i=i i=i 



and 



1 n r Pk 

X(9,9*) :=J2M0,9*) = y=J2J2^ ^-G*, k )^,k(Xi,i)^ 

k=l * n i=l k=l 3=1 

Then for a universal constant C , 



i.k- 



sup \Y £ (9,9*) 
.eee M {6*) 



X < C2 r ~ 1 E 



sup X(0,0*) 
.e€0 M (o*) 



x . 



Proof. We apply Theorem 244 in lTalagrandl 2005J] , cited in the present paper 
as Theorem 15 4 1 Note first that 



r Pk 



E( \X(9,9*) - X(9,9*)\ 2 \X ) = E H " 

fc=l 3=1 



For all and 6 we have 

r Pk 



I c ci 



in < ii Ei E>** - W;>*iii* ^ 2 r ' x E ii E hrfiMil 

k=l 3=1 fc=l 3 

By Hoeffding's inequality ( Hoeffding 19631 ]) 



P [\Y £ (9,9*)-Y S (9,9*)\ > \\p c g - p c § \\ n V2t 



X < 2exp[-i]. 



Hence, using Theorem 24.5 in Talagrand's book (ITalagran dl |2005l ]) (see Section 
\5\ Theorem 15 . 2[> . we get for a universal constant C, 



E 



sup \Y £ ( 



X < C2 r_1 E 



sup X(9,9* 
■0ee M (e*) 



X 



□ 



As a direct consequence (i.e., by bounding the right hand side in Theorem 14. ip . 
we obtain the bounds of interest for our problem. 
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Theorem 4.2 Assume Condition \4-l\ and let K n := max.^ ||i/ , j,fe||n- ^ e have 
for a universal constant C , 



E 



sup \Y £ ( 



■n 



Proof. Let X(9, 9*) be denned as in Theorem 14,11 As in Subsection 14.11 
but now for Gaussians instead of a Rademacher sequence, conditionally on 
X := (Xi, . . . , X n ), we have 



E 



sup X(9,9*) 
.eee M (d*) 



X < M 



21og(2p) 



n 



□ 



4.4 Non-linear functions 



We now consider the case where the loss pe is possibly not extended GLM, that 
is, its dependence on 9 is strictly non-linear. However, we do assume that it is 
component- wise Lipschitz in 9, i.e., that Condition 11.11 holds . 

Define for if; = (ipi, . . . , ip p ) T , 

1 n 

n 



l . 



i=l 



Let be the smallest eigenvalue of S n and A^ be its largest eigenvalue. We 
assume that A n > 0, thus excluding the case p > n. 



Theorem 4.3 Assume Condition For a universal constant C , it holds 

that 



E 



sup \Y £ {9,9*) 
e&e M (9*) 



X < CM 



2 log 



n \ 



/A, 



Proof. Use that 



and 



5>J^-[ln > ¥ n \\9\\l 

i=i 



EiWiii» <A2 " lfll " 2 



i=i 



Then apply the same arguments as in Theorem 14. 2 [ 



□ 
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5 The geometry of fi-balls 



We first describe here the generic chaining bound, specialized to our context 
and with a notation adjusted to our setting. Let £i,...,£ n be independent 
jV(0, redistributed random variables and V be a subset of W 1 . Define 



X„ 



1 n 
n ^-^ 



Moreover, write 



8=1 



1 n 

-E 
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Vi, v £ 



8=1 



Talagrand (jTalagrandl 20051 ] . Definition 1.2.3) calls a sequence of partitions 
{As}^o °f y admissible if it is an increasing sequence (i.e., .A s +i contains ,A S 
for all s > 1), and |»4 S | < 2 2 for all s. He defines for each v £ V and each s, 
the set As(v) as the unique element of A s that contains v, and A(A s (v)) as the 
diameter of A s (v). He writes 

72 (V,|| • || n ) :=infsup^2 s / 2 A(AW), 



s>0 



where the infimum is taken over all admissible partitions. 

Theorem 5.1 (The majorizing measure theorem, see \Talaarand hooA l. Theo- 
rem 2.1.1) For some universal constant C, we have 



^72(V, 



<E 



SUpX„ 

vev 



< C 72 (V, 



Talagrand derives the lower bound in the above theorem from Sudakov's mino- 
ration argument. As a consequence, Talagrand presents the following result. 

Theorem 5.2 !(Talaarancl \200A] . Theorem 2.1.5) Let {Y v : v £ V} be a 

stochastic process that satisfies for all t > 



PjjYu - Y d \ >Vij< 2exp 
Then for a universal constant C , we have 



\v — 



, V v, v £ V. 



E 



sup \Y V -Yy 



< CE 



sup \X V — X%\ 

V,VdV 



Let us compare here the situation with Dudley's ent ropy bound (IDudlevI 119 671). 
We fo rmulate it using chainin g along a tree, as in Biihlmann and van de Geer 



20111 ] . Subsection 14.12.4, or Ivan de Geer and Ledererl 20111 ] . Define R n :- 
Let for each s £ {0, 1, ... , 5*}, {Vj}f=! C V be a minimal 2 s i? r , 



su Pi,GV IMIn- 



covering set of V, that is, for all v £ V and all s there is a u| such that 
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v — V A 



\ n < 2~ s R n . Then for all v, we can find a end node v G {^j} such 
that \\v — v s \\ n < 2 s R n , and for each end node t;" 5 S {^f} one can a branch 
. . . , v s } such that ||u s — v s_1 || n < 2~ s ~ 1 i? n for all s = 1, . . . , S. Moreover, 
we can write (with X v o = 0) 



X v — J ^2(X v s — X v s-i) + X v — X v s. 



s=0 



Invoking 



\X V — X v s\ < 2 i? n „ 



\ i=i 



one arrives at Dudley's bound 

S 



E 



supXt, 



<J]2-( s - 1 ) J R r 



s=0 



21og(2iV s ) 



n 



+ 2~ 5 i2 



(8) 



Consider now a special case. We let be p vectors in W 1 , and let 



V:={^0^,- = ll^lli < 1}- 

3=1 

Let K n := maxi<j< p ||V>j||n- 

The f ollowing lemma rephrases the first part of Theorem 2.1.6 in Talagrandl 
2005]. We present a short proof to show that it is again based on the dual 



norm inequality 

Lemma 5.1 It holds for some universal constant C that 



72(V, || • || n ) < C 



21og(2p) 



K„ 
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Indirect Proof. Clearly, by the dual norm inequality 

. n p n 

sup X v = sup - V" V" Ojipijti = max - V] ipi jf. 



Hence, 



E 



sup X v 

vev 



< E max 

i<i<P 



1 - 
n ^— ' 



n 



The result now follows from Theorem 15.11 



□ 



In his b ook, Talagrand n ow poses the research question to prove Lemma 15.11 
directly (jTalagrandl 20051 ] . Research problem 2.1.9). We claim that this cannot 
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be done by applying Dudl ey's boun d. Our reasoning is as follows. U sing The- 



orem 6.2 in 



Pollard! 11 9901 (see also Ivan der Vaart and Wellnerl [19961 ]. Lemma 



2.6.11, or Buhlmann and van de Geerl 2011 ], Lemma 14.28), we see that 

log(2iV s ) <2 2s log(4p), Vs. (9) 
Insert this in ([8]) with the bound R„ < K„ , to find that 



E 



sup X„ 

vev 



< 2(5 + l)K ri 



21og(4p) 



+ 2- b K n 



Minimizing this over S gives a bound of order ( log n ) ydog p/ n K n . In other 
words (assuming the entropy bound ([9]) is up to constants tight, which we 
believe it is) invoking Dudley's bound instead of generic chaining leads to a 
redundant (log n)-factor. Apparently, Dudley's bound does not fully capture 
the geometry of £i-balls. 



6 Concluding remarks 

This paper combines results in literature concerning symmetrization, contrac- 
tion, deviation inequalities and chain i ng. Th eir application in statistical the- 
ory has been highlighted by Massart 2000bl ]. We have added now a new ap- 



plication, where generic chaining allows one to remove additional logn fac- 



tors. For example, we have improved the choice A x y log 3 n log [jp V n)/n in 
Stadler and van de Geer 2O10t ] to A x ^/\ogp/n. The geometric arguments 



to bound 72 in the case of convex hulls are still to be developed. Somehow, 
the generic chaining bound 72 better exploits the impossibility to play cat and 
mouse. 



7 Proofs of Theorems [23] and D 

Proof of Theorem 12.11 The Basic Inequality says that 

£(§; O ) + A||0||i < Y(9, 9*) + A||0*||i + S(9*; O ). 
Hence on T(0*), 

E (0; O ) + A||0||i < A o ||0 - 0*||i V \\ + A||0*||i + S{6*- O ) 

If ||0-0*||i < A , we get 

£ (0; 0°) + (A - A o )||0 - 0*||i < 2A 2 + £ (0*; 0°). 
Hence in the rest of the proof, we can assume ||0 — 0*||i > Aq- 
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For 6* = 6°, we get 

£0; d ) + (A - A )||^g||i < (A + A )||^ - 
which gives for any < 5 < 1, 

£0; ) + (A - Xo)\\e - 9% < 2X\\9 So - 9% 
< 2XT(L,So)T{e-9°) 

<6£0;9 o ) + 6h( 2XT ^ So) \ 

For general 9*, we get 

£0;9 O ) + (A - A )[fe[|i < (A + \ )\\6 St - + £{9*-9 ). 
If (A + A o )||0 s . ~9*\\i < 5£(9*; 9 ), we obtain 

£(Mo) + (A-A )||^||i < (l + (T)£(0*;0o)- 
And then, using A — Ao < A + Ao, 

£0;9 o ) + (X-X o )\\9-e*\\i < (1 + 26)£ (9*; 9 ). 
If (A + \ )\\0 S . -9*\\t > 5£(9*;9 ), we obtain 



£0;9 O ) + (X-Xo)\\9 ss \\i < ^(A + A )||^ -0*||i> 

o 



and hence 



£0;9 o ) + (\-\ o )\\0-9*\\ 1 < l±^(A + A )||^-^||i, 

< ^^(A + \ o )T(L s S*)t0 - 9*) + £(9*;9°) 
d 

< 45H ni + 26)(\ + \ )T(L s ,S*) \ + 5£{ ~. Q0) + + 5)g{r . e y 
It follows hat 

(1 - 28)£0; 9 ) + (A - X )\\9 - 0* ||x < 46H ^ + ^ + WL S , S*) 

+(1 + 2S)£(9*;9 ). 
Finally simplify the expression using A + Aq < 2A, and replacing 25 by 8. 



□ 



Proof of Theorem 12.21 We only describe the case 9* = 9°, the case 9* ^ 9° 
following by the same arguments. Repeat the proof of Theorem 12.11 with 9 
replaced by 9 := t9 + (1 — t)9°, where 

2M 



t :-- 



2M o + ||0-0°||i 

Note that \\9 - 9°\\i < 2M Q . By the proof of Theorem [2J] we obtain that 
actually \\9 - 0°||i < M on 72M (6>o)- But this implies ||0 - o ||i < 2M . Now, 
repeat the proof again, knowing that \\9 - < 2M . 



□ 
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